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Abstract 



t~^ ' This paper deals with the Bayesian analysis of graphical models of marginal in- 

o 

~^ • dependence for three way contingency tables. We use a marginal log-linear parame- 

trization, under which the model is defined through suitable zero-constraints on the 

- 1 — I . 

X 

5_( ' interaction parameters calculated within marginal distributions. We undertake a com- 

prehensive Bayesian analysis of these models, involving suitable choices of prior distri- 
butions, estimation, model determination, as well as the allied computational issues. 
The methodology is illustrated with reference to two real data sets. 
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1 Introduction 

Graphical models of marginal independence were originally introduced by Cox and Wer- 
muth (1993) for the analysis of multivariate Gaussian distributions. They compose a 
family of multivariate distributions incorporating the marginal independences represented 
by a bidirected graph. The nodes in the graph correspond to a set of random variables 
and the bidirected edges represent the pairwise associations between them. A missing 
edge from a pair of nodes indicates that the corresponding variables are marginally in- 
dependent. The complete list of marginal independences implied by a bidirected graph 
were studied by Kauermann (1996) and by Richardson (2003) using the so-called Markov 
properties. 

The analysis of the Gaussian case can be easily performed both in classical and 
Bayesian frameworks since marginal independences correspond to zero constraints in the 
variance-covariance matrix. The situation is more complicated in the discrete case, where 
marginal independences correspond to non linear constraints on the set of parameters. 
Only recently parameterizations for these models have been proposed by Lupparelli (2006), 
Lupparelli et al. (2008) and Drton and Richardson (2008). In this paper we use the pa- 
rameterization proposed by Lupparelli (2006) and Lupparelli et al. (2008) based on the 
class of marginal log-linear models of Bergsma and Rudas (2002). Each log-linear param- 
eter is calculated within the appropriate marginal distribution and a graphical model of 
marginal independence is defined by zero constraints on specific higher order log-linear 
parameters. Alternative parameterizations have been proposed by Drton and Richard- 
son (2008) based on the Moebius inversion and by Lupparelli et al. (2008) based on 
multivariate logistic representation of the models of Glonek and McCullagh (1995). 

We present a comprehensive Bayesian analysis of discrete graphical models of mar- 



ginal independence, involving suitable choices of prior distributions, estimation, model 
determination as well as the allied computational issues. Here we focus on the three way 
case where the joint probability of each model under consideration can be appropriately 
factorized. We work directly in terms of the vector of joint probabilities on which we impose 
the constraints implied by the graph. Then we consider a minimal set of probability 
parameters expressing marginal/conditional independences and sufficiently describe the 
graphical model of interest. We introduce a conjugate prior distribution based on Dirichlet 
priors on the appropriate probability parameters. The prior distribution factorize similarly 
to the likelihood. In order to make the prior distributions 'compatible' across models we 
define all probability parameters (marginal and conditional ones) of each model from 
the parameters of the joint distribution of the full table. In order to specify the prior 
parameters of the Dirichlet prior distribution, we adopt ideas based on the power prior 
approach of Ibrahim and Chen (2000) and Chen et al. (2000). 

The plan of the paper is as follows. In Section 2 we introduce graphical models of 
marginal independence, we establish the notation, we present Markov properties and we 
explain in detail their log-linear parameterization. Section 3 illustrates a suitable factor- 
ization of the likelihood function for all models of marginal independence in three-way 
tables. In Section 4, we consider conjugate prior distributions, we present an imaginary 
data approach for prior specification and we compare alternative prior set-ups. Section 5 
provides posterior model and parameter distributions which can be easily calculated via 
conjugate analysis. Two illustrative examples are presented in Section 6. Finally, we end 
up with a discussion and some final comments regarding our current research on the topic. 



2 Preliminaries 

2.1 Graphical Models of Marginal Independence 

A bidirected graph G = (V, E) is characterized by a vertex set V and an edge set E with 
the property that {vi,Vj) € -B if and only if {vj,Vi) € E. We denote each bidirected edge 
by {hi , Vj ) = {{vi,Vj), {vj,Vi)^ and we represent it with a bidirected arrow. Ifavertexuj is 
adjacent to another vertex Vj, then Vj is said to be spouse of Vi, and we write Vj € sp{vi). 
For a set ^ C V, we define sp{A) = yj{sp{v)\v € ^). The degree d{v) of a vertex v is 
the cardinahty of the spouse set. A path connecting two vertices, vq and fm, is a finite 
sequence of distinct vertices wq, • • • fm such that (uj_i, f j), i = 1, . . . , m, is an edge of the 
graph. A vertex set C C V is connected if every two vertexes Vi and Vj are joined by a 
path in which every vertex is in C. Two sets A,B ^V are separated by a third set (S* G V 
if any path from a vertex in A to a vertex in B contains a vertex in S. It can be shown 
that, if a subset of the nodes D is not connected then there exist a unique partition of it 
into maximal (with respect to inclusion) connected set Ci , . . . , C^ 

D = Ci u C2 u . . . u a. (1) 

The graph is used to represent marginal independences between a set of discrete 
random variables Xy = (X^, f € V), each one taking values i^ G X„. The cross- 
tabulation of variables Xy produces a contingency table of dimension |V| with cell fre- 
quencies n = \n{i), i ^ Z\ where Z = x^^-^T^. Similarly for any marginal M C V, we 
denote with Xm = {X^, v € M) the set of variables which produce the marginal table 
with frequencies njv/ = (nA/(iAf), im £ Zm] where Zm = XveM^v The marginal cell 
counts are the sum of specific elements of the full table and are given by 



nuiiM) = ^ n{j) 

where M = V \ M and Ijju = {j & I : jm = hi}- Therefore Ij^u refers to all cells 
of the full table for which the variables of the M marginal are constrained to the specific 
value iM- 

A graphical model of marginal independence is constructed via the following Markov 
properties. 

Definition 1 ; Connected Set Markov Property (Richardson, 2003). The distribution of 
a random vector X\; is said to satisfy the connected set Markov property if 

XcMXy\(^CVJsp{C)) (2) 

whenever % ^ C '^V is a connected set. 

A more exhaustive Markov property is the global Markov property, which requires all 
the marginal independences in ([2]), but also additional conditional independences. 

Definition 2 .• Global Markov Property (Kauermann, 1996 and Richardson, 2003). The 
distribution of a random vector Xy = {Xy,v e V} satisfies the Global Markov property if 

A is separated from B byV\{AuBuC) in G implies Xa^Xb\Xc, (3) 

with A, B and C disjoint subsets of V , and C may be empty. 

Despite the global Markov property is more exhaustive (in the sense that indicates both 
marginal and conditional independences), Drton and Richardson (2008) pointed out that 
a distribution satisfies the global Markov property if and only if it satisfies the connected 
set Markov property. 



^From the global Markov property, we directly derive that if two nodes i and j are 
disconnected, then XiALXj that is the variables are marginal independent. The same is 
true for any two sets A C V and B C V that are disconnected, implying that A ALB (are 
marginal independent). This can be easily generalized for any given disconnected set D 
satisfying ([1]). Then the global Markov property for the bidirected graph G implies 

Xc,ALXc,M...MXc,. 

According to Drton and Richardson (2008), a discrete marginal graphical model, as- 
sociated to a bidirected graph G, is a family P{G) of joint distributions for a categorical 
random vector Xy satisfying the global Markov property (or equivalently the connected 
set Markov property). Following the above, for every not connected set Z? C V, it holds 
that 

r 

P{XD = iD) = \{P{Xc,=ic,) (4) 

fc=l 

where Ci , . . . , C,- are the inclusion maximal connected sets satisfying ([1]) . 

2.2 A Parameterization for Marginal Log-Linear Models 

Lupparelli (2006) and Lupparelli et al. (2008) show that it is possible to define a param- 
eterization for any set Xy of categorical variables, by using the marginal log- linear model 
by Bersgma and Rudas (2002). 

Bergsma and Rudas (2002) suggested to work in terms of log-linear parameters A 
obtained from a specific set of marginal tables. They consider the following model 

A = Clog(Mvec(7r)) (5) 



where tt = ( '/r(i), i E T) is the joint probability distribution of Xy and vec(7r) is a vector of 
dimension \I\ obtained by rearranging the elements tt in a reverse lexicographical ordering 
of the corresponding variable levels with the level of the first variable changing first (or 



faster). For example in a 2 x 2 table the vector of probabilities will be given by vec(7r) = 
7r(l, l),7r(2, l),7r(l, 2),7r(2,2) I . In this paper we assume that the parameter vector 



A satisfies sum-to-zero constraints and we indicate with C the corresponding contrast 
matrix. Finally M is the marginalization matrix which specifies from which marginal we 
calculate each element of A. An algorithm for constructing C and M matrices is given in 
the Appendix ( for additional details see Appendix A in Lupparelli, 2006). 

2.2.1 Properties of Marginal Log-Linear Parameters 

Let M C V be a generic marginal, and indicate with S{M) the class of all subsets of M 
and with £m G S{M) the set of effects obtained from marginal M. 

Given A1 = {-/Vfi,M2, . . . ,M\j^\\ the set of marginals used to calculate the log-linear 
parameters A, we denote by Ag = ( Ag (ig), ig G Tg, M G A4 J , the set of parameters for 
effect e C M estimated by the marginal M and by A the set of all parameters estimated 
by the same marginal. 

According to Bersgma and Rudas (2002), in order to obtain a well-defined param- 
eterization, it is important to allocate the interaction parameters A among the chosen 
marginals to get a complete and hierarchical set of parameters. 

Definition 3 .■ Complete and hierarchical set of parameters. A set of marginal log-linear 
parameters A is called hierarchical and complete if: 

i) The elements of M are ordered in a non decreasing order, which means that no 
marginal is a subset of any preceding one, i.e. Mi ^ Mj if i > j (hierarchical 
ordering). 

Furthermore, we require the last marginal in the sequence to be the set all vertices 
under consideration, therefore M\j^i = V . 



ii) From each marginal M^ we can calculate only the effects that was not possible to get 
from the ones preceding it in the ordering under consideration; hence the sets £Mi of 
the effects under consideration are given by 

£m, = S{Mi) and £m^ = S{M,) \ {u^-\5(Mfc)} for i = I, . . . ,\M\ . 

The above set of parameters define a paranietrization of the distribution on the contin- 
gency table (see Bergsma and Rudas, 2002, for a formal definition). It is called complete 
since each parameter is estimated from one and only one marginal and hierarchical be- 
cause the full set of parameters is generated by marginals in a non decreasing ordering 
(Bersgma and Rudas, 2002, p 143-144). 

For every complete and hierarchical set of parameters, the inverse transformation of (0) 
always exists but it cannot be analytically calculated (Lupparelli, 2006, p. 39). Iterative 
procedures have been used to calculate tt for any given values of A and hence the likelihood 
of the graph under consideration (Rudas and Bergsma, 2004 and Lupparelli, 2006). 

An important problem when working with marginal log-linear parameters A is that 
we may end up with the definition of non-existing joint marginal probabilities. A way to 
avoid this is to consider the parameters' variation independence property; see Bergsma 
and Rudas (2002). A set of parameters is variation independent when the range of pos- 
sible values of one of them does not depend on the other's value. Hence the joint range 
of the parameters is the Cartesian product of the separate ranges of the parameters in- 
volved. This property ensures the existence of a common joint distribution deriving from 
the marginals of the model under consideration. Moreover, variation independence ensures 
strong compatibility which implies both compatibility of the marginals and existence of 
a common joint distribution. Compatibility of the marginals means that from different 
distributions we will end up to the same distribution for the common parameters, for 



example from marginals AB and AC we get the same marginal for A. To assure varia- 
tion independence of hierarchical log-linear parameters, a generalization of the classical 
decomposability concept is needed. 

Definition 4 ; Decomposable set of Marginals. A class of incomparable (with respect to 
inclusion) marginals M. is called decomposable if it has at most two elements (\M^\ < 2) 
or if there is an ordering Mi, . . . , M\j^i of its elements such that, for k = 3, . . . , \M\ there 
exist at least one jk < k for which the running intersection property is satisfied, that is 

(u^i^Mi) n Mfc = Mj^ n Mk . 

Definition 5 ; Ordered decomposable set of marginals. A class of marginals Ai is or- 
dered decomposable if it has at most two elements or if there is a hierarchical ordering 
Ml, . . . , M|_vi| of the marginals and, for k = 3, . . . , \A4\, the maximal elements (in terms 
of inclusion) of {Mi, . . . , M^} form a decomposable set. 

The importance of the above property is due to the theorem 4 of Bergsma and Rudas 
(2002) where they proved that a set of complete and hierarchical marginal log-linear pa- 
rameters is variation independent if and only if the ordering of the marginals involved 
is ordered decomposable. Hence order decomposability ensures the existence of a well 
defined joint probability. 

2.3 Construction of Marginal Log-Linear Graphical Models 

Based on the results of the previous subsection, Lupparelli (2006) and Lupparelli et al. 
(2008) proposed a strategy to construct a marginal log-linear parametrization for the 
family of discrete bidirected graphical. 

Initially we need to consider a set of a parameters A derived from a hierarchical ordering 
of the marginals in I'(G') U V; where T>{G) is the set of all disconnected components of the 



bidirected graph G. Then, we must set the highest order log-hnear interaction parameters 
of T>{G) equal to zero. 

More precisely, we need to consider the following steps: 

i) Construct a hierarchical ordering of the marginals Mj G T)[G). 

ii) Append the marginal M = V (corresponding to the full table under consideration) 
at the end of the list if it is not already included. 

iii) For every marginal table Mi € 'D{G) U V estimate all parameters of effects in Mi 
that have not already estimated from the preceding marginals. 

iv) For every marginal table Mi G 'D{G), set the highest order log- linear interaction 
parameter equal to zero; see Proposition 4.3.1 in Lupparelli (2006). 

In the three way case the log-linear parameters the marginals are always obtained 
from a set of order of order decomposable marginals, hence the parameters are variation 
independent. 

3 Likelihood Decomposition 

In this paper we propose to use a different approach from the one by Rudas and Bergsma 
(2004) and Lupparelli (2006) in order to estimate the joint distribution tt of a graph G. We 
impose the constraints implied by the graph G directly on the joint probabilities tt. We 
work with a minimal set of probability parameters tt*-^ expressing marginal/conditional 
independences and sufficiently describe the graphical model G under investigation. By 
this way we can always reconstruct the joint distribution tt for a given graph G via tt'^ 
and then simply calculate the marginal log- linear parameters directly using ^. Here we 



10 



focus on the three way case where the jomt probabihty of each model can be appropriately 
factorized for any graph G. 

For every three way contingency table eight possible graphical models models exist 
which can be represented by four different types of graphs: the independence, the satu- 
rated, the edge and the gamma structure graph (see Figured]). The independence graph is 
the one with the empty edge set [E = %), the saturated is the one containing all possible 
edges E = iy{v,w) : V ^ w ^V) , an edge graph is the one having only one single edge 
and a gamma structure graph is one represented by a single path of length two. In a three 
way table, three 'edge' and three 'gamma' graphs are available. 



B 



B 



B 



B 



A AAA 

(a) Independence Model (b) Saturated Model (c) Edge Model (d) Gamma Model 
Figure 1: Type of Graphs in Three Way Tables 



For the saturated model Gs, we get all parameters from the full table, i.e. tt ■^ = vr. 
Thus, the likelihood is directly written as 



/(n|7r,Gs) 



r(iv + i) 



n r(n(z) + l),ex 



n 



TT U 



,n(i) 



where A^ = X]jgjn(i) is the total sample size. 

The joint distributions for the independence and the edge models can be easily ex- 
pressed using the equation 



n r(n(f) + l) d&v{G) idSJd 
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where tt = (tt^, d G 'D{G)j and tt^ = iTTdiid), id ^ ^d) is the vector of probabihties 

corresponding to the table of the marginal d. The above equation derives directly from 

([1]) since the graph is disconnected. 

Finally, the decomposition of the gamma structures is not as straightforward as in the 
previous case. In addition to specific marginal probability parameters we also need to 
use some conditional ones. Let us denote by "c" the corner node, that is the vertex with 
degree 2 and by c = V \ c the end-points of the path. The set of disconnected marginals 
is equal to the end-point vertices of the graph, hence c = 'D{G). Then the likelihood can 
be written as 






/(nk^G)^ ^7+^^ . n n-^i^^M% 



iex 



n n ^SdT^''^ 

d£V{G) id&Id 



where tv = ( tt^Ic, tt^, d € '^{G) j , here ^{G) = V \ c and 



are all the conditional probabilities of c given c. 

The above factorization can be easily adopted to get maximum likelihood estimates 
analytically and avoid the iterative procedure used by Rudas and Bersgma (2004) and Lup- 
parelli (2006). In this paper we work using conjugate priors on the appropriate probability 
parameters of the above parametrization and then calculate the corresponding log-linear 
parameters. 

4 Prior distributions on cell probabilities 

4.1 Conjugate Priors 

For the specification of the prior distribution of the probability parameter vector we ini- 
tially consider a Dirichlet distribution with parameters a = {a{i), i € X ) for the vector 
of the joint probabilities of the full table tt = (7r(i), i G l), where 2 is the set of all cells 
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of the table under consideration. Hence, for the fuh table the prior density is given by 

/(tt) = ^^f n ^(^)"^^^"' = M-^-^ «) (6) 

n r[a{i)) igj 

where fviij^] oc) is the density function of the Dirichlet distribution evaluated at tt with 
parameters a and a = Yliei'^i'^)- 

Under this set-up, the marginal prior of 7r(i) is a Beta distribution with parameters 
a{i) and a — a{i), i.e. 7r[i) ~ Betaia{i), a — a{i) 1 . The prior mean and variance of each 

cell is given by 

j^r ,..1 a{i) r . a{i){a-a{i)} 
E[tt{i)\ = and V[tt{i)\ = . 

When no prior information is available then we usually set all a{i) = t^ resulting to 

Small values of a increase the variance of each cell probability parameter. Usual choices 
for a are the values \I\/2 (Jeffrey's prior), \T\ and 1 (corresponding to a{i) equal to 1/2, 1 
and 1/\I\ respectively); for details see Dellaportas and Forster (1999). The choice of this 
prior parameter value is of prominent importance for the model comparison due to the well 
known sensitivity of the posterior model odds and the Bartlett-Lindley paradox (Lindley, 
1957, Bartlett, 1957). Here this effect is not so adverse, as for example in usual variable 
selection for generalized linear models, for two reasons. Firstly even if we consider the 
limiting case where a{i) = -^ with a ^ 0, the variance is finite and equal to (|X| — l)/|Zp. 
Secondly, the distributions of all models are constructed from a common distribution of 
the full model/table making the prior distributions 'compatible' across different models 
(Dawid and Lauritzen, 2000 and Roverato and Consonni, 2004). 

The model specific prior distributions are defined by the constraints imposed by the 
model's graphical structure and the adopted factorization. The prior distribution also 
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factorizes in same manner as the likelihood described in section [3l Thus, the prior for the 
saturated model is the usual Dirichlet ([6]). 

For the independence and edge models the prior is given by 



/(-^G) = n 

deV{G) 



r(a) 



n r(Q(id)) i,gj. 



n 



TTdUd 



|0(«d)-l 



deV{G) 



with parameter vector tv = (vTrf, d G ^(G)) . We denote the above density which is 
simple product (over all disconnected sets) of Dirichlet distributions by 



,G\ 



f{'^n=fvvUd; ad,dGV{G) 



(7) 



For the gamma structure the prior is given by 

r(a(ic) 



p(-^) = n 



eic 



LC '^-i-c 



n T(a{ic,ic 



n ^cic(' 



^C) ^c. 



,a{ic,i-)-l 



(8) 



v*ceic 



fvvi^d] ad, de V{G) 



with parameter vector -k'-' [tt f.^-^, tt^, d G D(G)). The fist part of equation ([S]), that is the 
product for all level of c of Dirichlet distributions of the conditional probabilities, can be 
denoted by fc'Pv{j^c\c'i <^\ Then, the prior density dH) can be written as 



/(tt ) = fcvv{T^c\c; OL)fvv{'^d; ad,deV{G)) . 



(9) 



In order to make the prior distributions 'compatible' across models, we define the prior 
parameters of tv^ from the corresponding parameters of the prior distribution ([S]) imposed 
on the probabilities vr of the full table; see Dawid and Lauritzen (2000), Roverato and 
Consonni (2004). 

Let us consider a marginal M G A4{G) for which we wish to estimate the probability 
parameters tvm = {'^M{iM),Hi € 2m)- The resulting prior is tvm ~ 2>i[aM), that is a 
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Dirichlet distribution with parameters ctM = (aM(Hf)i*M G ^m) given by 

aniiM) = ^ a(j), 

see (i) of Lemma 7.2 in Dawid and Lauritzen (1993, p. 1304). 

For example, consider a three way table with V = {A, B, C} and the marginal M = C. 

Then the prior imposed on the parameters tvc of the marginal C is given by 

\Ia\ \Tb\ 
nc^Vi^ac) with ac{ic) = '^ y^ a{iA,iB,ic) for ic = 1,2, ... ,\Ic\, 



where tvc = (^7rc(l), . . . ,ttc{\Ic\)J- 

For the conditional distribution of Mi\M2 with Mi 7^ M2 G A^(G) we work in a 
similar way. The vector T^Mj^iM^i-lhh) = {'^Mi\M2{^Mi\iM2),'i'Mi G ^A/i) a priori follows a 
Dirichlet distribution 



■A/2 

The above structure derives from the decomposition of a Dirichlet as a ratio of Gamma 
distributions; see also Lemma 7.2 (ii) in Dawid and Lauritzen (1993, p. 1304). 

For example, consider marginals Mi = A and M2 = B in a. three way contingency 
table with V = {A, B, C}. Then, for a specific level of variable B, say is = 2, 

'^A\B{-\'iB = 2) ~ Vi{aAB{-, 2)) 

where aAB(-,2) = (^0^5(1, 2),aAB(2, 2), . . . ,aAB(|2^A|, 2)j and 

\lc\ 
aAsiiA,'^) = ^ aASc(u,2,ic)- 

4.2 Specification of Prior Parameters Using Imaginary Data. 

In order to specify the prior parameters of the Dirichlet prior distribution, we adopt ideas 
based on the power prior approach of Ibrahim and Chen (2000) and Chen et al. (2000). 

15 



We use their approach to advocate sensible values for the Dirichlet prior parameters on 
the full table and the corresponding induced values for the rest of the graphs as described 
in the previous sub-section. Let us consider imaginary set of data represented by the 
frequency table n* = {n*{i),i S 2) of total sample size A^* = X]jgx^*(^) and a Dirichlet 
'pre-prior' with all parameters equal to oq- Then the unnormalized prior distribution can 
be obtained by the product of the likelihood of n* raised to a power w multiplied by the 
'pre-prior' distribution. Hence 

/(tt) oc /(n*|7r)'^ x /©^(Tr; a{i) =ao,ie l) 

= /x)i(7r; a(i) ='u;n*(i) + ao,i e J) . (10) 

Using the above prior set up, we expect a priori to observe a total number of w N* + 
|Z|ao observations. The parameter w is used to specify the steepness of the prior dis- 
tribution and the weight of belief on each prior observation. For w = 1 then each 
imaginary observation has the same weight as the actual observations. Values of w < 
1 will give less weight to each imaginary observation while u; > 1 will increase the 
weight of believe on the prior/imaginary data. Overall the prior will account for the 
{w N* + \I\ao)/{w N* -|- A^ -|- |I|ao) of the total information used in the posterior distri- 
bution. Hence for w = 1, N* = N and ao — > then both the prior and data will account 
for 50% of the information used in the posterior. 

For w = 1/N* then a{i) = p*{i) + oq with p*{i) = n*{i)/N* , the prior data n* will 
account for information of one data point while the total weight of the prior will be equal 
to (1 -|- |X|ao)/(l -|- A^ -|- |2r|Q!o). If we further set oq = 0, then the prior distribution 
pop will account for information equivalent to a single observation. This prior set-up will 
be referred in this paper as the unit information prior (UIP). When no information is 
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available, then we may further consider the choice of equal cell frequencies n*{i) = n* for 
the imaginary data in order to support the simplest possible model under consideration. 
Under this approach A^* = n* x \I\ and w = 1/N* = ,^ir\ resulting to 

The latter prior is equivalent to the one advocated by Perks (1947). It has the nice 
property that the prior on the marginal parameters does not depend on the size of the 
table; for example, for a binary variable, this prior will assign a Beta{l/2, 1/2) prior on 
the corresponding marginal regardless the size of the table we work with (for example if 
we work with 2^ or 2x4x5x4 table). This property is retained for any prior distribution 
of type ([10]) with w* = 1/N*, p*{i) = l/\l\ and ckq oc 1/|T|. 

4.3 Comparison of Prior Set-ups 

Since Perks' prior (with a{i) = l/|2r|) has a unit information interpretation, it can be used 

as a yardstick in order to identify and interpret the effect of any other prior distribution 

used. Prior distribution with a{i) < 1/\I\, or equivalently a < 1, results in larger variance 

than the one imposed by our proposed unit information prior and hence it a posteriori 

supports more parsimonious models. On the contrary, prior distributions with a{i) > 

1/|T|, or a > 1, result in lower prior variance and hence it a posteriori support models 

with more complicated graph structure. So the variance ratio between a Dirichlet prior 

with a{i) = ct/\I\ and Perks prior is equal to 

V{7r,\ a{i) = ^) 2 

~ V{7ri\ a{i) = |J|-i) ~ a + 1 ' 

A comparison of the information used from some standard choices is provided in Table 
[TJ Prom this Table, we observe that Jeffreys' prior variance is lower than the corresponding 
Perks' prior reaching a reduction of about 60% and 85% for a 2^ and a 2 x 3 x 4 table 
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Table 1: Table of Prior Variance in Comparison to the Unit Information Prior (last row of the table) 



Variance Ratio 



Parameter 



V[Ai)] 



General Equation 



2x2x2 



3x2x4 



Jeffrey's a 

Unit Exp. Cell a 

EBP a 

UIP a 

PP with to = l/N* a 

UlP-Perks' a 



1/2 



9 1^1 -1 

^ |I|^{|Il+2} 



^1x1+2; 
2 



0.4 



0.22 



0.15 



0.08 



-.p{i) lp(j)(l_p(j)) ^v[-k{i)] 9.14 X (y[7r(j)]) 25.04 X (y[7r(j)]) 

-■V*(i) ip'{«)(l-P*«) l^P*W(l-P*W) 9.14xp*(i)(l-p*«) 25.04 xp*{i)(l-p*(.0) 



:p*(i) + QO S(p*(i),ao,|/|l:. |^5(p*(i),ao,|/|) 18.28 xS(p*(i),Q0,8) 50.08 x 5(p'(i), ao, 24) 

a/|J| i±J^ 1 1 1 



2|IP 



act */-\ \r\\ (p*{«) + ao) (l+l-f|ao-P*(0-0!o) 



respectively. The reduction is even greater for the prior of the Unit Expected Cell mean 
{a{i) = 1) reaching 78% and 92% respectively. 

Finally, we use for comparison an Empirical Bayes prior based on the UIP approach. 
Hence we set the imaginary data n*{i) = n{i), w = 1/N and oq = 0. Then the resulting 
prior parameters are given by a{i) = p{i), where p{i) = n[i)/N is the sample proportion. 
Under this set-up, the prior variance for each 7r(i) is equal to V^[vr(i)] = ip(i)(l —p{i)). 
Thus the above prior assumes that we have imaginary data with the same frequency table 
as the observed one but they accounts for information equal to one data point (Empirical 
UIP). 

5 Posterior Model and Parameter Distributions 

Since the prior is conjugate to the likelihood the posterior can be derived easily as fol- 
lows. For the saturated model the posterior distribution is also a Dirichlet distribution 
7r|n, G5 ~ Di(5) with parameters 

5 = (5(i) = a{i) + n{i), i € I^ 

For the independence and the edge structure the density of the posterior distribution is is 
equivalent to d?]), /(7r*^|n,G) = f-pvij^^', ol ) with 

a.^ = ( 5^, d G 'D{G)] and 5^ = {ad{id) = Oid{id) + nd{id), id e Id 



Finally, for the gamma structure /(7r'^|n,G) = fc'Pv{''^c\cj f^) ^ fvv{'^d] otd, d € 'D{G)) 
i.e. a distribution with density equivalent to the corresponding prior ^ with parameters 

5'^ = (a, ctd, de V{G)y 

^From the properties of the Dirichlet distribution, it derives that each element of tt*^ 
follows a Beta distribution with the appropriate parameters. 
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For model choice we need to estimate the posterior model probabilities f{G\n) oc 
f{n\G)f{G), with f{n\G) marginal likelihood of the model and /(G) prior distribution 
on G. Here we restrict to the simple case where f{G) is uniform, hence the posterior will 
depend only on the marginal likelihood f{n\G) of the model under consideration. The 
marginal likelihood can be calculated analytically since the above prior set-up is conjugate. 

For the saturated model the marginal likelihood is given by 

DK(a) 

/ n G = K{n) x -—^ 

where K{n) and DK{ol) are given by 

r(iV + l) \tj 

K{n) = -, r- and UK [a.) — 



n r n(i) + i nr «(^) 

iex ^ ^ iei ^ 

respectively. 

For the independence and the edge models the marginal likelihood is given by 



/(n|G) = A-(„) n §fM 



(11) 



d£V{G) 

where 

r Yl o^diid 

DK[cxd) = 



n r(ad(^d) 
Finally, for the gamma structure the marginal likelihood f{n\G) is given by 

f{n\G) = K{n) [[ ^^L. . A [[ ^^ ~ ( • 

j-Gi- ^ V ' c/7 deV(G) ^ "■' 

where 



(12) 



DK{oL{-,i^) 






The posterior distribution of the marginal log-linear parameters A can be estimated 
in a straightforward manner using Monte Carlo samples from the posterior distribution of 
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Tv^. Specifically, a sample from the posterior distribution of A can be generated by the 
following steps. 

i) Generate a random sample 7r'^''*'(t = 1, . . . ,T) from the posterior distribution of 

ii) At each iteration t, calculate the the full table of probabilities tt^*^ from tv '^^' . 

iii) The vector of marginal log linear parameters, A ' ' ^ , can be easily obtained from 
TT*^*' via equation ([5]) which becomes 



AG.W=C«log(M«vec(7rW; 



where C and M are the contrast and marginalization matrices under graph G. 
Note that some elements of A will automatically be constrained to zero for all 
generated values due to the graphical structure of the model G and the way we 
calculate log-linear parameters using the previous equation. 

Finally, we can use the generated values (A ' ^*^ ; t = 1, 2, . . . , T) to estimate summaries of 
the posterior distribution /(A \G) or obtain plots fully describing this distribution. 

6 Illustrative examples 

The methodology described in the previous sections is now illustrated on two real data 
sets, a 2 X 2 X 2 and a 3 x 2 x 4 tables. In both example we compare the results obtained 
with our yardstick prior, the UlP-Perks' prior (a(i) = 1/|/|), with those obtained using 
Jeffrey's (a(i) = 1/2), Unit Expected Cell (a(i) = 1), and Empirical Bayes {a{i) = p{i)) 
priors. 
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6.1 A2x2x2 Table: Antitoxin Medication Data 

We consider a data set presented by Healy (1988) regarding a study on the relationship 
between patient condition (more or less severe), assumption of antitoxin (yes or not) and 
survival status (survived or not); see Table [5J In Table E] we compare posterior model 
probabilities under the four different prior set-ups. 

Table 2: Antitoxin data 



Survival (S) 



Condition (C) Antitoxin (A) No Yes 

More Severe Yes 15 6 

No 22 4 

Less Severe Yes 5 15 

No 7 5 



Under all prior assumptions the maximum a posteriori model (MAP) is SC+A (we 
omit the conventional crossing (*) operator between variables for simplicity), assuming 
the marginal independence of Antitoxin from the remaining two variables. 

Under Empirical Bayes and UlP-Perks' priors the posterior distribution is concentrate 
on the MAP model (it takes into account 93.4% and 91.7% respectively of the poste- 
rior model probabilities). The posterior distributions under the Jeffreys' and the unit 
expected prior set-ups are more disperse, supporting the three models (SC-I-A, AS-I-SC 
and ASC) with posterior weights higher than 10% and accounting around the 94% of the 
posterior model probabilities. Model AS + SC is also the model with the second highest 
posterior probability under UlP-Perks' prior but its weight is considerably lower than the 
corresponding probability of the MAP model. 
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Table 3: Posterior model probabilities (%) for the Antitoxin data. 













Model 












A+S+C 


AS+C 


AC+S 


SC+A 


AS+AC 


AS+SC 


AC+SC 


ASC 


Prior Distribution 




(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


Jeffreys' a(i 


) = l/2 


0.3 


1.5 


0.2 


59.7 


0.1 


21.7 


3.0 


13.4 


Unit Expected Cell a{i 


) = 1 


0.2 


1.1 


0.2 


37.2 


0.1 


30.2 


4.7 


26.2 


Empirical Baycs a{i 


)=P« 


1.6 


2.4 


0.3 


93.4 


0.0 


1.7 


0.2 


0.4 


UlP-Perks' a(i 


) = 1/1^1 


1.2 


2.1 


0.3 


91.7 


0.0 


3.5 


0.4 


0.8 



Figure[2]presents boxplots summarizing 2.5%, 97.5% posterior percentiles and quantiles 
of the joint probabilities for the MAP model (SC+A) for the four prior set-ups. Since 
direct calculation from the posterior distribution is not feasible, we estimated the posterior 
summaries via Monte Carlo simulation (1000 values). From this figure, we observe minor 
differences between the posterior distributions obtained under the UlP-Perks' and the 
empirical Bayes prior. More differences are observed between Perks' UIP and the posterior 
distributions under the two other prior set-ups. Differences are higher for the first two cell 
probabilities, i.e. for 7r(l, 1, 1) and 7r(2, 1, 1). 

Similarly in Figure El we present boxplots providing posterior summaries for models 
SC + A, AS + SC and ASC under the UlP-Perks' prior set-up. The first two models 
are the ones with highest posterior probabilities and all of their summaries have been 
calculated using Monte Carlo simulation (1000 values). The saturated was used mainly as 
reference model since it is the only one for which the posterior distributions are available 
analytically. From the figure we observe that the posterior distributions on the joint 
probabilities vr of the full table are quite different highly depending on the assumed model 
structure. 

Finally, posterior summaries for the probability parameters it and the marginal log- 
linear parameters A for models SC + A, AS + SC and ASC (as described above) under 
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Figure 2: Antitoxin data: Boxplots summarizing 2.5%, 97.5% posterior percentiles and 
quantiles of the joint probabilities TTABcihJjk) for the MAP model (SC+A) for all prior 
set-ups (J=Jeffreys', U=Unit Expected Cell, E=Empirical Bayes, P=Perks') . 
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Figure 3: Antitoxin data: Boxplots summarizing 2.5%, 97.5% posterior percentiles and 
quantiles of the joint probabilities TTABcihJ, k) for models SC+A, AS+SC and ASC (4, 6 
and 8 respectively) under the UlP-Perks' prior set-up. 
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the UlP-Perks' prior are provided in Tables [J] and [5] respectively. All summaries of each 
element of tt'^ are obtained analytically based on the Beta distribution induced by the 
corresponding Dirichlet posterior distributions of iv^. Posterior summaries of A are esti- 
mated using the Monte Carlo strategy (1000 values) discussed in section[5j As commented 
in this section, some elements of A for graphs SC + A and AS + SC are constrained 
to zero due the way we have constructed our model. Hence for SC + A, the maximal 
interaction terms for the disconnected sets AS, AC and ASC, i.e. parameters Aa5(2,2), 
Aac(2, 2) and XascC^, 2, 2), are constrained to be zero for all generated observations. Sim- 
ilar is the picture for model AS + SC, but now only marginals AC and ABC correspond 
to disconnected sets implying that Aac(2,2) = Ay!is'c'(2, 2,2) = 0. 

6.2 A 3 X 2 X 4 table: Alcohol Data 

We now examine a well known data set presented by Knuiman and Speed (1988) regarding 
a small study held in Western Australia on the relationship between Alcohol intake (A), 
Obesity (O) and High blood pressure (H); see Table [6l 

In Table \7\ we report posterior model probabilities and corresponding Log-marginal 
likelihoods for each models. Under all prior set-ups the posterior model probability is 
concentrated on models H-|-A-|-0, HA-l-0 and HO-I-A. Empirical Bayes and UlP-Perks' 
support the independence model (with posterior model probability of 0.878 and 0.807 
respectively) whereas Jeffreys' and Unit Expected support a more complex structure, 
HO-I-A (with posterior model probability of 0.837 and 0.859 respectively). 

To save space we do not report here posterior summaries for model parameters, they 
can be found in a separate appendix on the web page: 

http : //stat-athens . aueb . gr/~ jbn/papers/paper21 . htm. 

7 Discussion and Final Comments 

In this paper we have dealt with the Bayesian analysis of graphical models of marginal asso- 
ciation for three way contingency tables. We have worked using the probability parameters 
of marginal tables required to fully specify each model. The proposed parametrization and 
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Table 4: Posterior summaries of model parameters for models SC + A, AS + SC and 
ASC in Antitoxin data under the UlP-Perks' prior set-up; a and h are the parameters of 
the resulted Beta marginal posterior distribution. 



Model 4: SC + A 








Beta Posterior 












Parameters 


Parameter 


Mean 


St.dev. 


Qo.025 


Qo.975 


a 


T> 


^sc(l,l) 


0.47 


0.055 


0.36 


0.57 


37.25 


42.75 


^sc(2,l) 


0.13 


0.037 


0.06 


0.21 


10.25 


69.75 


^5c(l,2) 


0.15 


0.040 


0.08 


0.24 


12.25 


67.75 


^a(I) 


0.52 


0.056 


0.41 


0.63 


41.50 


38.50 



Model 6: AS + SC 












Parameter 


Mean 


St.dev. 


Qo.mb 


Qo.975 


a 


6 


^5|AC(1|1,1) 


0.71 


0.096 


0.51 


0.88 


15.12 


6.12 


^5|Ac(l|2,l) 


0.84 


0.070 


0.68 


0.95 


22.12 


4.12 


^5|Ac(l|l,2) 


0.25 


0.094 


0.09 


0.46 


5.12 


15.12 


^s|Ac(l|2,2) 


0.58 


0.136 


0.31 


0.83 


7.12 


5.12 


^a(I) 


0.52 


0.056 


0.41 


0.63 


41.50 


38.50 


^c(l) 


0.59 


0.055 


0.48 


0.70 


47.50 


32.50 



Model 8: ASC (Satm-ated) 



Parameter 


Mean 


St.dev. 


Qo.025 


Qo.975 


a 


6 


41,1,1) 


0.19 


0.044 


0.11 


0.28 


15.12 


64.88 


^(2,1,1) 


0.28 


0.050 


0.18 


0.38 


22.12 


57.88 


^(1,2,1) 


0.08 


0.030 


0.03 


0.14 


6.12 


73.88 


^(2,2,1) 


0.05 


0.025 


0.01 


0.11 


4.12 


75.88 


^(1,1,2) 


0.06 


0.027 


0.02 


0.13 


5.12 


74.88 


^(2,1,2) 


0.09 


0.032 


0.04 


0.16 


7.12 


72.88 


^(1,2,2) 


0.19 


0.044 


0.11 


0.28 


15.12 


64.88 


7r(2,2,2) 


0.06 


0.027 


0.02 


0.13 


5.12 


74.88 
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Table 5: Antitoxin data: Posterior summaries for lambda for models SC-I-A, AS-I-SC and 
ASC under the UlP-Perks' prior set-up 



Model 4: 


SC + A 










Parametei 




Marginal table 


Mean 


St.dev. 


(30.025 


Qo.975 


Ag 




Mas 


-1.429 


0.032 


-1.513 


-1.388 


A^(2) 




Mas 


-0.040 


0.113 


-0.258 


0.181 


As(2) 




Mas 


-0.245 


0.118 


-0.480 


-0.021 


Aas(2,2) 




Mas 


0.000 


0.000 


0.000 


0.000 


Ac(2) 




Mac 


-0.194 


0.116 


-0.426 


0.027 


Aac(2,2) 




Mac 


0.000 


0.000 


0.000 


0.000 


Asc(2,2) 




Masc 


0.460 


0.134 


0.199 


0.735 


Aasc(2,2, 


2) 


Masc 


0.000 


0.000 


0.000 


0.000 



Model 6: AS + SC 



Parameter 


Marginal table 


Mean 


St.dev. 


(30.025 


Qo.975 


A0 


Mac 


-1.418 


0.025 


-1.483 


-1.388 


A^(2) 


Mac 


-0.042 


0.114 


-0.261 


0.173 


Ac(2) 


Mac 


-0.195 


0.110 


-0.414 


0.020 


Aac(2,2) 


Mac 


0.000 


0.000 


0.000 


0.000 


As(2) 


Masc 


-0.238 


0.137 


-0.493 


0.044 


Aas(2,2) 


Masc 


-0.291 


0.137 


-0.554 


-0.019 


Asc(2,2) 


Masc 


0.437 


0.137 


0.178 


0.712 


Aasc(2,2,2) 


Masc 


-0.086 


0.143 


-0.370 


0.207 



Model 8: ASC (Saturated) 



Parameter 


Marginal table 


Mean 


St.dev. 


(90.025 


(9o.975 


M 


Masc 


-2.325 


0.079 


-2.504 


-2.191 


\a{2) 


Masc 


-0.106 


0.134 


-0.379 


0.152 


As(2) 


Masc 


-0.246 


0.131 


-0.510 


0.004 


Aas(2,2) 


Masc 


-0.292 


0.139 


-0.576 


-0.033 


Ac(2) 


Masc 


-0.136 


0.143 


-0.402 


0.151 


Aac(2,2) 


Masc 


-0.084 


0.139 


-0.355 


0.202 


Asc(2,2) 


Masc 


0.450 


0.135 


0.207 


0.705 


Aasc(2,2,2) 


Masc 


-0.074 


0.143 


-0.368 


0.209 
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Table 6: Alcohol Data 



Alcohol intake 




High BP 


(drinks/days) 


Obesity 





1-2 


3-5 6+ 


Low 


Yes 


5 


9 


8 10 




No 


40 


36 


33 24 


Average 


Yes 


6 


9 


11 14 




No 


33 


23 


35 30 


High 


Yes 


9 


12 


19 19 




No 


24 


25 


28 29 



Table 7: Alcohol data: Posterior model probabilities and the corresponding Log marginal 
likelihoods; empty cell in posterior probabilities means that it is lower than 0.0001 



Posterior model probabilities (%) 



Prior Distribution 



Model 
H+A+O HA+O HO+A AO+H HA+HO HA+AO HO+AO HAO 

(1) (2) (3) (4) (5) (6) (7) (8) 



Jeffreys' 

Unit Expected Cell 

Empirical Bayes 

Perks' 



11.56 
6.91 

87.81 
80.67 



4.76 83.68 

7.21 85.88 

0.07 12.12 

0.15 19.18 



Log-marginal likelihood for each model 













Model 












H+A+O 


HA+O 


HO+A 


AO+H 


HA+HO 


HA+AO 


HO+AO 


HAO 


Prior Distribution 




(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


Jeffreys (a(i) = 1/2) 




-79.22 


-80.11 


-77.24 


-87.73 


-90.44 


-100.93 


-98.06 


-98.95 


UEC (a(i) = 1) 




-78.51 


-78.47 


-75.99 


-84.70 


-85.27 


-93.99 


-91.51 


-91.46 


Emprirical Bayes {a{i' 


= p(i)) 


-86.96 


-94.10 


-88.94 


-107.26 


-124.75 


-143.06 


-137.91 


-145.04 


Perks (a{j) = 1/|/|) 




-86.90 


-93.19 


-88.33 


-107.10 


-121.13 


-139.89 


-135.03 


-141.33 
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the corresponding decomposition of the hkehhood simphfies the problem and automati- 
cahy imposes the marginal independences represented by the considered graph. By this 
way, the posterior model probabilities and the posterior distributions for the used param- 
eters can be calculated analytically. Moreover, the posterior distributions of the marginal 
log-linear parameters A and the probabilities tt of the full table can be easily obtained 
using simple Monte Carlo schemes. This approach avoids the problem of the inverse calcu- 
lation of 77 when the marginal association log-linear parameters A are available which can 
be only achieved via an iterative procedure; see Rudas and Bergsma (2004) and Lupparelli 
(2006) for more details. 

An obvious extension of this work is to implement the same approach in tables of 
higher dimension starting from four way tables. Although most of the models in a four 
way contingency table can be factorized and analyzed in a similar manner, two type of 
graphs (the 4-chain and the cordless four-cycle graphs) cannot be decomposed in the above 
way. These models are not Markov equivalent to any directed acyclic graph (DAG). In fact 
each bidirected graph (which corresponds to a marginal association model) is equivalent 
to a DAG, i.e. a conditional association model, with the same set of variables if and only 
if it does not contain any 4-chain, see Pearl and Wermuth (1994). We believe that also 
in higher dimensional problems our approach can be applied to bidirected graphs that 
admit a DAG representation. For the graph that do not factorize, more sophisticated 
techniques must be adopted in order to obtain the posterior distribution of interest and 
the corresponding marginal likelihood needed for the model comparison (work in progress 
by the authors). 

Another interesting subject is how to obtain the posterior distributions in the case that 
someone prefers to work directly with marginal log- linear parameters A defined by ([5]). 
Using our approach, we impose a prior distribution on the probability parameters tv^. The 
prior of A cannot be calculated analytically since we cannot have the inverse expression 
of ([5]) in closed form. Nevertheless, we can obtain a sample from the imposed prior on A 
using a simple Monte Carlo scheme. More specifically, we can generate random values of 
TV from the Dirichlet based prior set-ups described in this paper. We calculate the joint 
probability vector n according to the factorization of the graph under consideration and 
finally use ([5]) to obtain a sample from the imposed prior f{X\G). This will give us an 
idea of the prior imposed on the log- linear parameters. 

If prior information is expressed directly in terms of the log-linear parameters, see 
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e.g Knuiman and Speed (1988) and Dellaportas and Forster (1999), the prior and the 
corresponding posterior distribution of tt can be obtained using two alternative strategies. 

One possibihty is to approximate the distribution imposed on the elements of tt 
via Dirichlet distributions with the parameters obtained in the following way. Firstly we 
generate random values from the prior imposed on the standard log-linear parameters 
for models of conditional association. For each set of generated values, we calculate the 
corresponding probabilities tt for the full table. Finally we obtain a sample for vr*^ via 
marginalization from each set of generated probabilities tt. For every element of tt*^, we 
use the corresponding generated values to approximate the imposed prior by a Dirichlet 
distribution with the parameters estimated using the moment-matching approach. Note 
that this approach can only provide us a rough picture of the correct posterior distribution 
since the priors are only matched in terms of the mean and the variance while their shape 
can be totally different due to the properties of the Dirichlet distribution. 

Similar will be the approach if the prior distributions f{X\G) for the marginal log- 
linear parameters A are available. The only problem here, in comparison to the simpler 
approach described in the previous paragraph, is the calculation of tt from each A . In 
order to achieve that we need to use iterative procedures; see Rudas and Bergsma (2004) 
and Lupparelli (2006). 

A second approach is to directly calculate the prior distribution imposed on the prob- 
ability parameters tv'^ starting from the prior /(A |G) using equation ([5]). Note that the 
probabilities tt of the full table involved in ([5]) are simply a function of tv^ depending on 
the structure G. Hence, the prior on tv will be given by 

/(TT^IG) = /(A^IG) ^ 

where p is vec(7r*^) after removing the last element of each set of probability parameters 
and vec(7r'-') refers to vr*^ arranged in a vector form. The elements of the Jacobian are 
given by 



dX? 



k 



coliC) 

E 



''"i i.i 



-1 



a.|f:M,vec(.),) Em,'-^ 



where col{c) is the number of columns of C matrix. For the saturated model the above 
equation simplifies to 



^P^ tt EfllM^,^ec{7v), 
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since vec(7r)j = p^ for j < \2\ and vec(7r)|j| = 1 — X^'J;^ p9. After calculating the 
corresponding prior distribution f{7v'-'\G), we can work directly on tt'^ using an MCMC 
algorithm to generate values from the resulted posterior. A sample of A can be again 
obtained in a direct way using ([5]). When no strong prior information is available, an 
independence Metropolis algorithm can be applied using as a proposal the Dirichlet distri- 
butions resulted from the likelihood part. Otherwise more sophisticated techniques might 
be needed. The authors are also exploring the possibility to extend the current work in 
this direction. 
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Appendix 

1. Construction of Matrix M 

Let Ai = {Ml, M2, . . . , Mij^i} be the set of considered marginals. Let B be a binary 
matrix of dimension \M\ x |V| with elements Bi^ indicating whether a variable v 
belongs to a specific marginal Mj. The rows of B correspond to the marginals in 
A1 whereas the columns to the variables. The variables follow a reverse ordering, 
that is column 1 corresponds to variable -^jvh column 2 to variable Xiyix and so 
on. Matrix B has elements 

f 1 [fveMi 

I otherwise, 
for every f € V. 
The marginalization matrix M can be constructed using the following rules. 

(a) For each marginal Mi, the probability vector of the corresponding marginal ta- 
ble is given by Mjvr; where Mj is calculated as a Kronecker product of matrices 
A- 

Mi = (g)Ai„ 
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with 

h^ if Bi, = l 

ll if Bi, = 

where iy is the number of levels for v variable, Ii^ is the identity matrix of 

dimension i^ x i^ and 1^^ is a vector of dimension iy xl with all elements equal 

to one. 

(b) Matrix M is constructed by stacking all the Mj matrices 



M 



M,; 



y M|M| J 



2. Construction of Matrix C 

Firstly we need to construct the design matrix X for the saturated model corre- 
sponding to sum to zero constraints. It has has dimension I J^ ^v] ^ ( 11 ^^v] and 
can be obtained as 



with 



J^. (r, c) 



In matrix notation 



1 if c=l or r = c 
— 1 if r = 1 and c > 1 
otherwise. 

_-,T 

1(^,-1) I(^,-I)x{£„-1) 



where 1 



(^.-1) 



IS 



1) X 1 vector of ones while I(f^_i)x(£„-i) is an identity matrix 



of dimension {iy — 1) x (t^ — I). 

The contrast matrix C can be constructed by using the following rules. 

(a) For each margin Mi construct the design matrix Xi corresponding to the sat- 
urated model (using sum to zero constraints) and invert it to get the contrast 
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matrix for the saturated model Cj = X~^ . Let C* be a submatrix of Cj ob- 
tained by deleting rows not corresponding to elements of £Mi (the effects that 
we wish to estimate from margin Mi) . 

(b) The contrast matrix C is obtained by direct sum of the C* matrices as follow 

C= C* 

i: Mi>^M 

that is it is a block diagonal matrix with (CJ;Mj € A4) as the blocks. For 
example C* (^ C2 = ® j=i C* is the block diagonal matrix with Ci and C2 as 
blocks. 
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